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(54) Continuous speech recognition of text and commands 



(57) In a method for use in recognizing continuous 
speech, signals are accepted corresponding to inter- 
spersed speech elements including text elements cor- 
responding to text to be recognized and command ele- 



ments corresponding to commands to be executed. The 
elements are recognized. The recognized elements are 
acted on in a manner which depends on whether they 
represent text or commands. 
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Description 

Background 

This invention relates to continuous speech recog- 
nition. 

Many speech recognition systems recognize spo- 
ken text in one mode and spoken commands in another 
mode. In one example, the dictation mode requires dis- 
crete speech while the command mode may be handled 
by continuous/discrete speech. In dictation mode, a us- 
er's discrete speech is recognized as, e.g., English 
words, and the recognized words are displayed to the 
user. The user may dictate any word that is within a vo- 
cabulary held in the system without having to follow any 
particular structure. This is called "free context" discrete 
speech. In command mode, the system recognizes ei- 
ther continuous or discrete speech and executes the 
commands. For example, if the user says "underline last 
three words," the system recognizes the command and 
then underlines the last three words that the user spoke 
in dictation mode. The user speaks commands as struc- 
tured speech in accordance with a particular structure 
or template. For example, the user may say "underline 
last three words" but not "underline the last three words" 
or "please underline last three words." The user switch- 
es between command mode and dictation mode by 
speaking "Command Mode", double clicking on an icon 
representing the mode the user wants to switch into, or 
typing a switch mode command. 

Summary 

In general, in one aspect, the invention features a 
method for use in recognizing continuous speech. Sig- 
nals are accepted corresponding to interspersed 
speech elements including text elements corresponding 
to text to be recognized and command elements corre- 
sponding to commands to be executed. The elements 
are recognized. The recognized elements are acted on 
in a manner which depends on whether they represent 
text or commands. 

Implementations of the invention may include one 
or more of the following. The text may be acted on by 
providing it to a text processing application. The com- 
mands may be acted upon by causing an application to 
perform a step. The recognizing may be based on nat- 
ural characteristics of spoken text versus spoken com- 
mands. The recognizing may include evaluating the like- 
lihood that a given element is either a command element 
or a text element. The recognizing may be biased in fa- 
vor of a given element being text or a command. The 
biasing may include determining if a given one of the 
elements reflects a command reject or conforms to a 
command template; or comparing recognition scores of 
the given element as a command or as text; or deter- 
mining the length of silence between successive ones 
of the elements or whether the actions of the user imply 
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that a given one of the elements cannot be text. 

The recognizing may include, in parallel, recogniz- 
ing the elements as if they were text, and recognizing 
the elements as if they were commands. The recogniz- 
5 ing of elements as if they were text (or commands) may 
be temporarily stopped upon determining that an ele- 
ment is a command element (or a text element). The 
results of the recognition may be displayed to a user. 
The results may be partial results. The user may be en- 
10 abled to cause a re-recognition if the element is incor- 
rectly recognized as text or a command. The user may 
cause a re- recognition if a command element is recog- 
nized as a text element, and in response to the ^rec- 
ognition a text processing application may undo the in- 
is elusion of the text element in text being processed. Prior 
to acting on a recognized command element, informa- 
tion associated with the command element may be dis- 
played to a user; a direction may be accepted from the 
user to consider previous or subsequent elements as 
20 either text or commands but not both. 

The advantages of the invention may include one 
or more of the following. Recognizing spoken com- 
mands within dictated text allows users to intermittently 
execute commands that affect the text (e.g., underlining 
25 or bolding particular words) without requiring the user to 
switch between separate command and dictation 
modes. Moreover, user confusion is reduced because 
the user is not required to remember which mode the 
system is in. 

30 other advantages and features will become appar- 
ent from the following description and from the claims. 

Description 

35 Fig. 1 is a block diagram of a speech recognition 
system. 

Fig. 2 is a block diagram of speech recognition soft- 
ware and application software. 

Fig. 3 is a block diagram of speech recognition soft- 
40 ware and vocabularies stored in memory. 

Fig. 4 is a flow chart of recognizing both commands 
and dictated text. 

Fig. 5 is a computer screen display of word process- 
ing commands. 
45 Fig. 6 is a computer screen display of examples of 
word processing commands. 

Fig. 7 is a block diagram of word processing com- 
mands. 

Figs. 8a, Sb, 9a, and 9b are computer screen dis- 
50 plays of partial results and command execution results. 
Fig. 10 is a another flow chart of recognizing both 
commands and dictated text. 

Fig. 11 is a block diagram of speech recognition 
software and vocabularies stored in memory. 
55 Fig. 1 2 is a flow chart of structured continuous com- 
mand speech recognition. 

Fig. 13 is a block diagram of spreadsheet com- 
mands. 
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Fig. 14 is a another flow chart of recognizing both 
commands and dictated text. 

Figs. 1 5a- 1 5d are computer screen displays depict- 
ing the process of correcting a misrecognized com- 
mand. 

The system recognizes both continuously spoken 
commands and continuously dictated text by taking ad- 
vantage of characteristics common to the natural 
speech of most users. For instance, users typically 
pause (e.g., 0.5 sec) before and after speaking a com- 
mand. Similarly, following a pause, users begin com- 
mands by speaking action verbs (e.g., underline, bold, 
delete) and begin dictated text by speaking nouns. To 
take advantage of these and other characteristics, the 
system expects the user to pause before and after 
speaking a command and to follow a particular structure 
or template when speaking a command (e.g., all com- 
mands begin with action verbs). These requirements im- 
prove the accuracy with which the system distinguishes 
between dictated text and commands. 

Referring to Fig. 1 , a typical speech recognition sys- 
tem 1 0 includes a microphone 1 2 for converting a user's 
speech into an analog data signal 14 and a sound card 
16. The sound card includes a digital signal processor 
(DSP) 1 9 and an analog-to-digital (A/D) converter 1 7 for 
converting the analog data signal into a digital data sig- 
nal 18 by sampling the analog data signal at about 11 
Khz to generate 220 digital samples during a 20 msec 
time period. Each 20 ms time period corresponds to a 
separate speech frame. The DSP processes the sam- 
ples corresponding to each speech frame to generate a 
group of parameters associated with the analog data 
signal during the 20 ms period. Generally the parame- 
ters represent the amplitude of the speech at each of a 
set of frequency bands. 

The DSP also monitors the volume of the speech 
frames to detect user utterances. If the volume of three 
consecutive speech frames within a window of five con- 
secutive speech frames exceeds a predetermined 
speech threshold, for example, 20 dB, then the DSP de- 
termines that the analog signal represents speech and 
the DSP begins sending a batch of, e.g., three, speech 
frames of data at a time via a digital data signal 23 to a 
central processing unit (CPU) 20. The DSP asserts an 
utterance signal (Utt) 22 to notify the CPU each time a 
batch of speech frames representing an utterance is 
sent via the digital data signal. 

When an interrupt handler 24 on the CPU receives 
assertions of Utt signal 22, the CPU's normal sequence 
of execution is interrupted. Interrupt signal 26 causes 
operating system software 28 to call a store routine 29. 
Store routine 29 stores the incoming batch of speech 
frames into a buffer 30. When fourteen consecutive 
speech frames within a window of nineteen consecutive 
speech frames fall below a predetermined silence 
threshold, e.g., 6 dB, then the DSP stops sending 
speech frames to the CPU and asserts an EndJJtt sig- 
nal 21 . The End Utt signal causes the store routine to 



organize the batches of previously stored speech 
frames into a speech packet 39 corresponding to the 
user utterance. 

Interrupt signal 26 also causes the operating sys- 
tem software to call monitor software 32. Monitor soft- 
ware 32 keeps a count 34 of the number of speech pack- 
ets stored but not yet processed. An application 36, for 
example, a word processor, being executed by the CPU 
periodically checks for user input by examining the mon- 
itor software's count. If the count is zero, then there is 
no user input. If the count is not zero, then the applica- 
tion calls speech recognizer software 38 and passes a 
pointer 37 to the address location of the speech packet 
in buffer 30. The speech recognizer may be called di- 
rectly by the application or may be called on behalf of 
the application by a separate program, such as Dragon- 
Dictate™ from Dragon Systems™ of West Newton, 
Massachusetts, in response to the application's request 
for input from the mouse or keyboard. 

For a more detailed description of how user utter- 
ances are received and stored within a speech recogni- 
tion system, see United States Patent No. 5,027,406, 
entitled "Method for Interactive Speech Recognition and 
Training" which is incorporated by reference. 

Referring to Fig, 2, to determine what words have 
been spoken speech recognition software 38 causes 
the CPU to retrieve speech frames within speech packet 
39 from buffer 30 and compare the speech frames to 
speech models stored in one or more vocabularies 40. 
For a more detailed description of continuous speech 
recognition, see United States Patent No. 5,202,952, 
entitled "Large-Vocabulary Continuous Speech Prefil- 
tering and Processing System", which is incorporated 
by reference. 

The recognition software uses common script lan- 
guage interpreter software to communicate with the ap- 
plication 36 that called the recognition software. The 
common script language interpreter software enables 
the user to dictate directly to the application either by 
emulating the computer keyboard and converting the 
recognition results into application dependent key- 
strokes or by sending application dependent commands 
directly to the application using the system's application 
communication mechanism (e.g., Microsoft Windows™ 
uses Dynamic Data Exchange™). The desired applica- 
tions include, for example, word processors 44 (e.g., 
Word Perfect™ or Microsoft Word™), spreadsheets 46 
(e.g., Lotus 1-2-3™ or Excel™), and games 48 (e.g., 
solitaire™). 

As an alternative to dictating directly to an applica- 
tion, the user dictates text to a speech recognizer win- 
dow, and after dictating a document, the user transfers 
the document (manually or automatically) to the appli- 
cation. 

Referring to Fig, 3, when an application first calls 
the speech recognition software, it is loaded from a disk 
drive into the computer's local memory 42. One or more 
vocabularies, for example, common vocabulary 48 and 
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Microsoft Office™ vocabutary 50, are also loaded from 
remote storage into memory 42. The vocabularies 48, 
52, and 54 include all the words 48b, 50b, and 54b (text 
and commands), and corresponding speech models 
48a, 50a, and 54a, that a user may speak. 

Spreading the speech models and words across dif- 
ferent vocabularies allows the speech models and 
words to be grouped into vendor (e.g., Microsoft™ and 
Novell™) dependent vocabularies which are only load- 
ed into memory when an application corresponding to 
a particular vendor is executed for the first time after 
power-up. For example, many of the speech models and 
words in the Novell PerfectOffice™ vocabulary 54 rep- 
resent words only spoken when a user is executing a 
Novell PerfectOffice™ application, e.g., WordPer- 
fect™. As a result, these speech models and words are 
only needed when the user executes a Novell" applica- 
tion. To avoid wasting valuable memory space, the 
Novell PerfectOffice™ vocabulary 54 is only loaded into 
memory when needed (i.e., when the user executes a 
Novell™ application). 

Alternatively, the speech models and words may be 
grouped into application dependent vocabularies. For 
example, separate vocabularies may exist for Microsoft 
Word™, Microsoft Excel™, and Novell WordPerfect™ . 
As another alternative, only a single vocabulary includ- 
ing all words, and corresponding speech models, that a 
user may speak is loaded into local memory and used 
by the speech recognition software to recognize a user's 
speech. 

Referring to Fig. 4, once the vocabularies are stored 
in local memory an application calls the recognition soft- 
ware, in one method, the CPU compares speech frames 
representing the user's speech to speech models in the 
vocabularies to recognize (step 60) the user's speech. 
The CPU then determines (steps 62 and 64) whether 
the results represent a command or text. Commands in- 
clude single words and phrases and sentences that are 
defined by templates (i.e., restriction rules). The tem- 
plates define the words that may be said within com- 
mand sentences and the order in which the words are 
spoken. The CPU compares (step 62) the recognition 
results to the possible command words and phrases and 
to command templates, and if the results match a com- 
mand word or phrase or a command template (step 64), 
then the CPU sends (step 65a) the application that 
called the speech recognition software keystrokes or 
scripting language that cause the application to execute 
the command, and if the results do not match a com- 
mand word or phrase or a command template, the CPU 
sends (step 65b) the application keystrokes or scripting 
language that cause the application to type the results 
as text. 

Referring to Fig. 5, while dictating text, the user may 
cause the computer to display a command browser 66 
by keystroke, mouse selection, or utterance (e.g., 
speaking the phrase "What Can I Say" 68 into the mi- 
crophone). The command browser displays possible 



commands for the application being executed. For ex- 
ample, a word processing application includes single 
command words, e.g., [Bold] 70 and [Center] 72, com- 
mand phrases, e.g., [Close Document] 74 and [Cut This 

s Paragraph] 76, and flexible sentence commands, e.g., 
[<Action> <2 to 20> <Text Objects>] 78 and [Move <Di- 
rection> <2 to 20> <Text Objects>] 80. Referring also to 
Fig. 6, the user may select a command shown in the 
command browser to display examples 82 of the select- 

10 ed command 80. 

Referring to Fig. 7, the command sentences, e.g., 
78, 80, 84, and 88, are spoken in accordance with a tem- 
plate and without long, e.g., greater than 0.5 second, 
pauses between the words of the sentence. (The length 
of the pause may be adjusted to compensate for a par- 
ticular user's speech impediment.) For example, com- 
mand 80 requires the user to speak the fixed word 
"Move 1 88 followed by a direction variable 90 (i.e., di- 
rection* "Up", "Down", ■Left", "Right", "Back", or "For- 

20 ward"), a number variable 92 (i.e., <2 to 20>: "2", "3", 
"4", ... or "20"), and, optionally (dashed line 94), a plural 
text object variable 96 (i.e., <Text Objects>: "Charac- 
ters", "Words", "Lines", "Sentences", or "Paragraphs") . 
If the user wants to move up two lines in previously dic- 

25 tated text, the user says "Move Up 2 Lines". The user 
may not say "Move Up 2", "Please Move Up 2 Lines", 
or "Move Up Last 2 Lines" because this speech does 
not follow the template for Move command 80. 

Referring back to Fig. 3, in addition to including 

30 words (and phrases) and corresponding speech mod- 
els, the vocabularies include application (e.g., Microsoft 
Word™ 100 and Microsoft Excel™ 102) dependent 
command sentences 48c, 50c, and 54c available to the 
user and application dependent groups 48d, 50d, and 

35 54d which are pointed to by the sentences and which 
point to groups of variable words in the command tem- 
plates. 

Aside from pointing to groups of variable words, the 
groups define the application dependent keystrokes (or 
40 scripting language) for each word that may be spoken. 
For example, when the user speaks a command sen- 
tence beginning with "Capitalize" while executing Micro- 
soft Word™, the action group points to the word "Cap- 
italize" and provides the following keystrokes: 

45 

{Alt+OJet{Enter}. 

When executing Novell WordPerfect™, the action 
so group also points to the word "Capitalize" but provides 
the following keystrokes: 

{Alt+e}vi{Right0}. 

55 

Each command sentence in the loaded vocabular- 
ies 48, 50, and 54 includes pointers to the different com- 
ponents of the sentence. For example, command sen- 
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tence 1 02 includes a pointer to the fixed word Move 1 78 
(and its corresponding speech model) and pointers to 
the groups, e.g., <Direction> 120, <2 to 20> 122, and 
<Text Objects> 124. The groups include pointers to the 
words in the groups (and the corresponding speech 
models), e.g., direction words 126, numbers 128, and 
text object words 130. 

The pointers allow components of each sentence 
to be spread across several stored vocabularies and 
shared by the sentences of different vocabularies. For 
example, the command sentence 136 ([Print Pages 
<Number/l to99> to <Number/1 to 99>], Fig. 5) is stored 
in both the Microsoft Office™ vocabulary 50 (not shown) 
and the Novell PerfectOffice™ vocabulary 54 (not 
shown), while the speech models and words (i.e., num- 
bers 1 to 99) are stored in Number vocabulary 138. To 
allow for "cross-vocabulary recognition", the pointers in 
vocabularies 48, 50, and 54 reference by name the vo- 
cabulary in which the words can be found. For example, 
the variable words 1, 2, 3, ... 99 may be found in the 
Number vocabulary (e.g., <Number/1 to 99>). Once the 
vocabularies are copied into local memory, the named 
references are resolved and replaced with actual ad- 
dress locations of the words within the local memory. 

Through cross-vocabulary recognition, a word may 
be added to a variable group of words (e.g., <1 to 99>) 
in only one vocabulary instead of to each vocabulary in- 
cluding the group. Additionally, the variable group of 
words is not repeated across several vocabularies. 

While a user's speech is being recognized, the CPU 
sends keystrokes or scripting language to the applica- 
tion to cause the application to display partial results (i. 
e., recognized words within an utterance before the en- 
tire utterance has been considered) within the document 
being displayed on the display screen (or in a status win- 
dow on the display screen). If the CPU determines that 
the user's speech is text and the partial results match 
the final results, then the CPU is finished. However, if 
the CPU determines that the user's speech is text but 
that the partial results do not match the final results, then 
the CPU sends keystrokes or scripting language to the 
application to correct the displayed text. Similarly, if the 
CPU determines that the user's speech was a com- 
mand, then the CPU sends keystrokes or scripting lan- 
guage to the application to cause the application to de- 
lete the partial results from the screen and execute the 
command. 

For example, the application being executed by the 
system is a meeting scheduler (Figs. 8a, 8b, 9a, and 
9b). After the system displays partial results 302 "sched- 
ule this meeting in room 507" (Fig. 8a), the system de- 
termines that the utterance was a command and re- 
moves the text from the display screen (Fig. 8b) and ex- 
ecutes the command by scheduling 304 the meeting in 
room 507. Similarly, after the system displays partial re- 
sults 304 "underline last three words" (Fig. 9a), the sys- 
tem determines that the utterance was a command and 
removes the text from the display screen (Fig. 9b) and 



executes the command by underlining 306 the last three 
words. 

The partial results allow the user to see how the rec- 
ognition is proceeding. If the speech recognition is not 
s accurate the user can stop speaking and proceed by 
speaking more slowly or clearly or the user or a techni- 
cian can use the partial results information to diagnose 
speech recognition system errors. 

One difficulty with recognizing both commands and 
text against the same set (i.e., one or more) of vocabu- 
laries is that language modeling information in the vo- 
cabularies may cause the CPU to recognize a user's 
spoken command as text rather than as a command. 
Typically, the speech models for dictated words include 
language modeling information about the way a user 
naturally speaks a given language. For example, the 
word "bold" is generally followed by a noun, e.g., "That 
was a bold presentation." On the other hand, command 
sentences are purposefully stilted or unnatural (e.g., be- 
ginning with action verbs instead of nouns) to distinguish 
them from text and improve speech recognition accura- 
cy. For example, the command "bold" is generally fol- 
lowed by a direction (e.g., next, last), a number (e.g., 2, 
3, 4), or a text object (e.g., character, paragraph), e.g., 
"Bold last paragraph." When a user's speech is recog- 
nized for commands and text against the same set of 
vocabularies, any language modeling information in the 
vocabularies tends to cause the system to favor the rec- 
ognition of text over commands. 

Referring to Figs. 10 and 11, one alternative is to 
execute two serial recognitions. The CPU begins by rec- 
ognizing (step 1 40, Fig. 1 0) the user's speech using one 
or more dictated word vocabularies 150 (Fig. 11 ) includ- 
ing words (and corresponding speech models) that a us- 
er may say while dictating text. This vocabu lary includes 
language modeling information but not command sen- 
tences. The CPU then recognizes (step 142) the user's 
speech using one or more command vocabularies 1 52 
including only command words, phrases, and sentenc- 
es (and corresponding speech models and groups). 
Each recognition (steps 140 and 142) assigns a score 
to each recognition based on how closely the user's 
speech matches the speech models corresponding to 
the word or words recognized. The scores of both rec- 
ognitions are then compared and the CPU determines 
(step 144) whether the user's speech was a command 
or text. 

For command recognition, the CPU compares the 
initial speech frames to only a first group of possible 
speech models representing the first words of com- 
mands and does not compare the initial speech frames 
to every speech model in the command vocabularies. 
As an example, the initial speech frames are not com- 
pared to the direction variables, 'Up", "Down", "Left", 
"Right", "Back", and "Forward." Limiting the number of 
speech models to which the speech frames are com- 
pared reduces the time for the comparison and increas- 
es the accuracy of the command recognition. 
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Referring also to Fig. 1 2, for continuous speech rec- 
ognition, the recognizer engine begins in state 1 and 
waits 200 until the user begins speaking. As the user 
begins to speak, the CPU recognizes the beginning of 
the user's first word and pre-fi Iters the first group of 5 
speech models for those speech models having similar 
sounding beginnings, e.g., "Select", "Center", "Single 
Space", "Set Font", "Set Size." The pre-filtered speech 
models provide a possible command sentence list of, 
for example, twenty, possible command sentences that m> 
the user may be speaking. 

The recognizer continues by comparing the succes- 
sive speech frames to the pre-filtered speech models 
but not to other speech models (e.g., "Bold"). The pos- 
sible command list is ranked in the order of highest to is 
lowest probability with the command including the 
speech model most closely matching the speech frames 
being of highest probability (best candidate). As the 
CPU continues to compare successive speech frames 
to the pre-filtered speech models, the CPU actively re- 20 
ranks the command list if the probabilities change. 

If the CPU determines that the speech frames sub- 
stantially match speech models for one or more first 
words in one or more commands, the CPU uses the 
pointer in each command to the next command compo- 25 
nent (i.e., second word) to begin comparing the succes- 
sive speech frames to groups of speech models repre- 
senting possible second words. For example, if the 
speech recognition engine recognizes the word Copy 
202 as one of the twenty possible first words spoken, 30 
then the speech recognizer engine uses the references 
in the <Action> command sentences 78 (Fig. 7) to begin 
comparing (state 2) the successive speech frames to 
the speech models representing the words in the <Next 
or Previous> group 204, including, "Previous", "Last", 35 
"Back", "Next", and "Forward", the fixed word "Selec- 
tion" 206, the <2 to 20> group 208, and the <Text Ob- 
jects> group 210. The speech recognizer may also iden- 
tify the beginning of the second word to pre-filter the 
speech models representing possible second command 40 
sentence words. 

Because some words take longer to speak than oth- 
ers, the speech recognition engine simultaneously con- 
tinues to compare the successive speech frames to 
longer pre-filtered speech models. Thus, as the speech 45 
recognizer compares (state 2) the successive speech 
frames to groups of speech models representing possi- 
ble second words in the command sentences starting 
with the word "Copy", the speech recognizer continues 
to compare the successive speech models to longer so 
speech models representing the words "Capitalize" 212 
and "Quote" 21 4. The continued comparison may cause 
the CPU to list one of these other possibilities as a higher 
probability than "Copy" 202 followed by a second com- 
mand sentence word. 55 

The command sentences are similar in grammar 
and limited in number to reduce the amount of user con- 
fusion and to permit the user to easily memorize the pos- 



sible commands. The variables (e.g., <Action>, <Style>, 
<Next or Prev>) in the command sentences provide the 
user with a wide variety of commands without introduc- 
ing a large number of individual command sentences. 

Additional command sentences may be generated 
for other types of applications. For example, Fig. 1 3 dis- 
plays possible command sentences for spreadsheet ap- 
plications (e.g., Lotus 1-2-3™ and Excel™). The com- 
mand sentence templates are generic across spread- 
sheet applications. However the keystrokes from the 
command recognizer software to the application are ap- 
plication dependent (i.e., the keystrokes required by Lo- 
tus 1 -2-3™ may be different from the keystrokes re- 
quired by Excel™). 

The CPU favors dictated text over similarly scored 
commands because it is easier for the user to delete 
misrecognized commands that are typed into a docu- 
ment than it is for a user to undo text that is misrecog- 
nized and executed as a command. For example, if the 
user dictates "belated fall flies," and the system recog- 
nizes the text "belated fall flies" and the command "de- 
lete all files", it is easier for the user to delete the typed 
command "delete alt files" than it is for the user to regain 
all deleted files. 

In favoring text, the system first determines if there 
was a command reject. Command rejects include noise 
picked up by the system microphone. The speech 
frames may be identified as noise if they match speech 
models corresponding to background noise, telephone 
ringing, or other common noises, or the user utterance 
may be considered a command reject if the command 
recognition scores below an empirically tuned thresh- 
old. The user may be given the ability to vary the thresh- 
old to provide the user with some control over the pre- 
cision of spoken commands. Other command rejects in- 
clude insufficient volume or excessive volume, hard- 
ware errors, or buffer overflow errors. Several command 
rejects may also be considered text rejects as well. 

In favoring text, the system next determines if the 
user's speech conformed to a command template. Us- 
er's speech that does not conform to a command tem- 
plate does not provide a valid recognized command. A 
user's speech does not conform to a template if the user 
does not speak permitted words in the predetermined 
order or if the user inserts pauses between the words of 
a command. For example, if the user says "bold last 3 
(pause) words", the words "bold last 3" are considered 
one utterance while the word "words" is considered an- 
other utterance. Neither utterance conforms to a com- 
mand template, and, thus, neither utterance provides a 
valid recognized command result. 

The system also favors text by comparing the rec- 
ognition score of the command against the recognition 
score of the text. If the text score is higher, then the sys- 
tem recognizes the user's speech as text. If the text and 
command scores are equal or the command score is 
within an empirically tuned range of the text score, then 
the system favors text by recognizing the user's speech 
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as text. For example, the empirically tuned range may 
be determined by multiplying the number of words in the 
utterance by an empirically determined number, e.g., 
1 00. However, because the commands are stilted or un- 
natural, the recognition score of a correctly spoken com- 
mand will generally greatly out score the text recogni- 
tion. 

Dictated text is not favored where the user cannot 
be dictating text. For instance, if the user has pulled 
down a window menu, then a user's spoken utterance 
can only be a command and thus, command recognition 
is favored or only command recognition is executed. 

If the CPU determines that the user's speech is text, 
the CPU sends (step 146) keystrokes or scripting lan- 
guage representing the recognized words to the appli- 
cation that called the speech recognition software. If the 
CPU determines that the user's speech is a command, 
the CPU sends (step 148) keystrokes or scripting lan- 
guage commands to the application to cause the appli- 
cation to execute the command. 

Recognizing dictated text takes longer (e.g., 1.2 re- 
al time) than the recognizing commands (e.g., 0.1 real 
time). One reason for the increase in time is that the 
dictated text vocabulary is much larger than the com- 
mand vocabulary. Recognizing the dictated text before 
recognizing commands takes advantage of the speak- 
ing time required by the user. 

Because the dictated text and command vocabular- 
ies are separate, they may be optimized for their respec- 
tive purposes without reducing the accuracy of either 
recognition. As discussed, the dictated text vocabulary 
may include language modeling information. Similarly, 
the command vocabulary may include modeling infor- 
mation that optimizes command recognition. For in- 
stance, the word "sentence" may have a higher proba- 
bility (i.e., receive a higher score) than the word "char- 
acter". 

Another alternative is parallel speech recognition of 
both dictated text and commands. Referring to Fig. 14, 
the CPU simultaneously recognizes (steps 160 and 
162) dictated text and commands by simultaneously 
comparing the speech frames of the user utterance 
against one or more dictated text vocabularies 150 and 
one or more command vocabularies 1 52. The CPU then 
compares (step 1 64) the results of both recognitions and 
determines (step 166) if the user utterance is a com- 
mand or text. Again, the CPU favors text recognition 
over command recognition. If the CPU determines that 
the user utterance is a command, then the CPU sends 
(step 168) keystrokes or scripting language to the appli- 
cation that called the speech recognition software to 
cause the application to execute the recognized com- 
mand. If the CPU determines that the user utterance is 
dictated text, then the CPU sends (step 1 70) keystrokes 
or scripting language to the application to cause the ap- 
plication to type the recognized text. 

If the first word of a user utterance is recognized as 
a first word of a command sentence, then the CPU may 



stop the dictated text recognition and complete only the 
command recognition. Similarly, if the first word of a user 
utterance is not recognized as a first word of a command 
sentence, then the CPU may stop the command recog- 
5 nition and complete only the dictated text recognition. 
Additional speech recognition optimizations and optimi- 
zations for distinguishing text from commands are also 
possible. 



Referring to Figs. 1 5a-1 5c, if the speech recognition 
system incorrectly recognizes a spoken command 310 
as dictated text, the user may cause (by keystroke, 
mouse selection, or spoken command, e.g., "That was 
a command" 312, Fig. 15b) the CPU to re-execute the 
speech recognition software. The CPU then rerecogniz- 
es the user's previous utterance and generates key- 
strokes or scripting language commands to cause the 
application that called the speech recognition software 
to delete the previously typed text (Fig. 15c). Where a 
separate command vocabulary is available, the re-rec- 
ognition is executed only against this vocabulary to in- 
crease the likelihood that the spoken command is cor- 
rectly recognized. If the re-recognition provides a com- 
mand result with a score that exceeds the empirically 
tuned threshold then the CPU generates keystrokes or 
scripting language commands that cause the applica- 
tion to execute the command (e.g., underlined text 31 4, 
Fig. 15c). 

Referring to Fig. 15d, if the re-recognition does not 
provide a command result or the score of the command 
result is insufficient, then the CPU displays the re-rec- 
ognized text 310 within a command window 220 on the 
system's display screen. Alternatively the command 
window is displayed each time the user selects re-rec- 
ognition or, for each selection of re-recognition, the user 
determines when the command window is displayed. As 
previously described, there are many reasons why a 
command may be incorrectly recognized. For example, 
if the user does not speak a command in accordance 
with a command template, then the CPU cannot recog- 
nize the user's speech as a command. Similarly, if the 
user's environment is especially noisy or the user 
speaks too quickly or unclearty, then the CPU may not 
recognize the user's speech as a command. Displaying 
the re-recognized speech to the user allows the user to 
detect their own mistakes as well as environmental 
problems. This information may also be used to diag- 
nose system problems. 

Recognition of a "dangerous" (i.e., difficult to undo) 
command may also cause the CPU to display the re- 
recognized command within the command window. For 
example, if the CPU recognizes the command "Delete 
all files", before executing this 'dangerous' command, 
the CPU displays the command for the user. The CPU 
may also display low scoring commands for the user. If 
the user agrees that the displayed command in the com- 
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mand window is the command the user wants to exe- 
cute, then the user requests (by keystroke, mouse se- 
lection, or utterance "OK") the execution of the com- 
mand. If the user does not agree with the displayed com- 
mand or the displayed text does not match a valid com- 
mand, then the user may edit the previously spoken 
command, for example, by typing the correct command 
in the command window or by saying "Edit" followed 
again by the intended spoken command. The system 
then executes and recognized valid commands or again 
displays any recognized speech that does not conform 
to a command template. 

To avoid misrecognizing commands, the user may 
notify the system ahead of time that the user is going to 
speak a command. For example, the user may say "Si- 
mon Says" (or another unusual phrase) before speaking 
a command or hold in the control key when speaking a 
command. When the system recognizes "Simon Says" 
it does not type it as text but uses it as a notification that 
the next utterance is or the following words in the same 
utterance are a command. The command notification 
may be used to prevent the CPU from choosing recog- 
nized dictated text as the result or to compare the utter- 
ance only to a command vocabulary (where available) 
to further improve speech recognition accuracy Provid- 
ing a command notification is particularly useful when- 
the user is going to speak a command that the system 
regularly misrecognizes as text. For other easily recog- 
nized commands, the user may choose not to provide 
the notification. 

Instead of notifying the system that the user is going 
to speak a command, the system may be notified that 
the user is going to dictate text. 

Additionally, if the speech recognition system incor- 
rectly recognizes dictated text as a command, the user 
may cause (by keystroke, mouse selection, or spoken 
command, e.g., "Type That') the CPU to re-execute the 
speech recognition software. 

Other embodiments are within the scope of the fol- 
lowing claims. 

For example, instead of having a digital signal proc- 
essor (DSP) process the samples corresponding to 
each speech frame to generate a group of parameters 
associated with the analog data signal during each 20 
ms time period, the CPU includes front-end processing 
software that allows the CPU to generate the parame- 
ters. 



Claims 

1 . A method for use in recognizing continuous speech 
comprising 

accepting signals corresponding to inter- 
spersed speech elements including text ele- 
ments corresponding to text to be recognized 
and command elements corresponding to com- 



mands to be executed, 
recognizing the elements, and 
acting on the recognized elements in a manner 
which depends on whether they represent text 
5 or commands. 

2. The method of claim 1 in which the text is acted on 
by providing it to a text processing application. 

10 3. The method of claim 1 in which the commands are 
acted upon by causing an application to perform a 
step. 

4. The method of claim 1 in which the recognizing is 
is based on natural characteristics of spoken text ver- 
sus spoken commands 

5. The method of claim 1 in which the recognizing in- 
cludes evaluating the likelihood that a given ele- 

20 ment is either a command element or a text ele- 
ment. 

6. The method of claim 1 further including biasing the 
recognizing in favor of a given element being text 

25 or a command. 

7. The method of claim 6 in which the biasing includes 
determining if a given one of the elements reflects 
a command reject. 

30 

8. The method of claim 6 in which the biasing includes 
determining if a given one of the elements conforms 
to a command template. 

35 9. The method of claim 6 in which the biasing includes 
comparing recognition scores of the given element 
as a command or as text. 

10. The method of claim 6 in which the biasing includes 
40 determining the length of silence between succes- 
sive ones of the elements. 

11. The method of 6 in which the biasing includes de- 
termining whether the actions of the user imply that 

45 a given one of the elements cannot be text. 

12. The method of claim 1 in which the recognizing 
comprises, in parallel 

so recognizing the elements as if they were text, 

and 

recognizing the elements as if they were com- 
mands. 

55 13. The method of claim 12 further comprising tempo- 
rarily stopping the recognizing of elements as if they 
were text (or commands) upon determining that an 
element is a command element (or a text element). 
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1 4. The method of claim 1 further comprising displaying 
to a user the results of the recognition. 

15. The method of claim 14 wherein the results are par- 
tial results. s 



16. The method of claim 1 further comprising enabling 
the user to cause a re-recognition if the element is 
incorrectly recognized as text or a command. 

17. The method of claim 16 in which 



10 



the user may cause a precognition if a com- 
mand element is recognized as a text element, 
and « 
in response to the re-recognition a text 
processing application may undo the inclusion 
of the text element in text being processed. 

18. The method of claim 1 in which prior to acting on a 20 
recognized command element, information associ- 
ated with the command element is displayed to a 
user. 



19. The method of claim 1 further comprising 25 

accepting from a user a direction to consider 
previous or subsequent elements as either test or 
commands but not both. 

20. Software stored on a medium for use in recognizing 30 
speech comprising 

instructions for accepting signals correspond- 
ing to interspersed speech elements including 
text elements corresponding to text to be rec- 35 
ognized and command elements correspond- 
ing to commands to be executed, 
instructions for recognizing the elements, and 
instructions for acting on the recognized ele- 
ments in a manner which depends on whether 40 
they represent text or commands. 



21 . A method for use in recognizing speech comprising 



accepting signals corresponding to inter- 45 
spersed speech elements including text ele- 
ments corresponding to text to be recognized 
and command elements corresponding to com- 
mands to be executed, 

recognizing the elements, 50 
biasing the recognizing in favor of a given ele- 
ment being text or a command, 
acting on the text by providing it to a text 
processing application, 

acting on the commands by causing an appli- 55 
cation to perform a step, and 
displaying to a user the results of the recogni- 
tion. 
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EXAMPLES Of SENTENCES 



QUICK REFERENCE CARD FOR WORD PROCESSING SENTENCES 
SYNTAX OF SENTENCES 



MOVE UP 1 

MOVE RIGHT 4 WORDS 
MOVE DOWN 1 LINE 
MOVE BACK 17 SENTENCES 
MOVE FORWARD 6 PARAGRAPHS 



MOVE 



DIRECTION 



88 



MOVE 



UP 

DOWN 

LEFT 

RIGHT 

BACK 

FORWARD 



MOVE 



TEXT OBJECTS 



96 
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20 



94 
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-90 



CHARACTER® 

WORD(S) 

UE(S) 

SENTENCES) 

PARAGRAPHS) 



■80 



SELECT LINE 
DELETE SENTENCE 
CUT LAST PARAGRAPH 
COPY 3 WORDS 
BOLD 5 LINES 

UNDERLINE NEXT 3 PARAGRAPHS 
CAPITALIZE FORWARD 2 SENTENCES 
QUOTE LAST 15 WORDS 



ACTION 



NEXTORPRB/ 



2TO20 



TEXTOBJECTfSj 



SELECT 

DELETE 

CUT 

COPY 

BOLD 

UNDERLINE 
ITALICS 
NORMAL 
CAPITALIZE 
UPPERCASE 
LOWERCASE 
QUOTE 



PREVIOUS 

LAST 

BACK 

NEXT 

FORWARD 
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WORD 

LINE 

SENTENCE 
PARAGRAPH 
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SELECTION 



CHARACTERS 

WORDS 

LINES 

SENTENCES 
PARAGRAPHS 



INDENT PARAGRAPH 
BULLET SELECTION 
SINGLE SPACE PARAGRAPH 
JUSTIFY SELECTION 

SET FONT COURIER 

SET SIZE 28 

SET FONT TIMES 22 

BEGINNING OF SENTENCE 
TOP OF DOCUMENT 
BOTTOM OF SELECTION 



STYLE 



PMORSEL 



INDENT 

UWNDENT 

HANG 

BULLET 

NUMBER 

CENTER 

LEFT ALIGN 

RIGHT ALIGN 
JUSTIFY 
SINGLE SPACE 
DOUBLESPACE 



PARAGRAPH 
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SET- 
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TOP 
END 
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LINE 
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DOCUMENT 
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PRINT PAGE 26 
PRINT PAGES 3 TO 46 
INSERT A IS BY 6 TABLE 
OPEN LAST4 FILES 
GOTO PAGE 43 
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PRINT PAGE(S] 
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This is a test of continuous speech recogn ition. I can dicta te 



text and I can also dictate commands like: underline last 



three words 
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QUICK REFERENCE CARD FOR SPREADSHEET SENTENCES 



EXAMPLES OF SENTENCES 



SYNTAX OF SENTENCES 



SELECT ROW 
DELETE COLUMN 
CUT LAST ROW 
COPY SELECTION 




AVERAGE COLUMN 
SELECT 3 ROWS 
SUM 5 COLUMNS 
FILL 15 ROWS 



FUNCTION 


2TO20 
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SELECT 
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MOVE TO CELL ALPHA 5 
SELECT TO CELL BRAVO 2 6 
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PRINT PAGE27 
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This is a test of continuous speech recognition. I can dictate 
text and I can also dictate commands like: underlying lasting 
words 
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